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Abstract 

Error correction of sequenced reads remains a difficult task, especially in single-cell sequencing projects with 
extremely non-uniform coverage. While existing error correction tools designed for standard (multi-cell) sequencing 
data usually come up short in single-cell sequencing projects, algorithms actually used for single-cell error 
correction have been so far very simplistic. 

We introduce several novel algorithms based on Hamming graphs and Bayesian subclustering in our new error 
correction tool BAYESHAIVIIVIER. While BAYESHAMMER was designed for single-cell sequencing, we demonstrate 
that it also improves on existing error correction tools for multi-cell sequencing data while working much faster on 
real-life datasets. We benchmark BAYESHAMMER on both /(-mer counts and actual assembly results with the 
SPADES genome assembler. 



Background 

Single-cell sequencing [1,2] based on the Multiple Displa- 
cement Amplification (MDA) technology [1,3] allows one 
to sequence genomes of important uncultivated bacteria 
that until recently had been viewed as unamenable to 
genome sequencing. Existing metagenomic approaches 
(aimed at genes rather than genomes) are clearly limited 
for studies of such bacteria despite the fact that they 
represent the majority of species in such important stu- 
dies as the Human Microbiome Project [4,5] or discovery 
of new antibiotics-producing bacteria [6] . 

Single-cell sequencing datasets have extremely non- 
uniform coverage that may vary from ones to thousands 
along a single genome (Figure 1). For many existing error 
correction tools, most notably QUAKE [7], uniform cov- 
erage is a prerequisite: in the case of non-uniform cover- 
age they either do not work or produce poor results. 

Error correction tools usually attempt to correct the set 
of A^-character substrings of reads called k-mers and then 
propagate corrections to whole reads which are important 
to have for many assemblers. Error correction tools often 
employ a simple idea of discarding rare A^-mers, which 
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obviously does not work in the case of non-uniform 
coverage. 

Medvedev et al. [8] recently presented a new approach 
to error correction for datasets with non-uniform cover- 
age. Their algorithm HAMMER makes use of the Ham- 
ming graph (hence the name) on A^-mers (vertices of the 
graph correspond to A^-mers and edges connect pairs of 
A^-mers with Hamming distance not exceeding a certain 
threshold). HAMMER employs a simple and fast cluster- 
ing technique based on selecting a central k-mer in each 
connected component of the Hamming graph. Such cen- 
tral A^-mers are assumed to be error-free (i.e., they are 
assumed to actually appear in the genome), while the 
other /c-mers from connected components are assumed to 
be erroneous instances of the corresponding central 
A^-mers. However, HAMMER may be overly simplistic: in 
connected components of large diameter or connected 
components with several k-mers of large multiplicities, it 
is more reasonable to assume that there are two or more 
central /c-mers (rather than one as in HAMMER). Biologi- 
cally, such connected components may correspond to 
either (1) repeated regions with similar but not identical 
genomic sequences {repeats) which would be bundled 
together by existing error correction tools (including 
HAMMER); or (2) artificially united /r-mers from distinct 
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Figure 1 Logarithmic coverage plot for the single-cell £ coli dataset. Logarithmic coverage plot for the single-cell £ coli dataset (similar plot 
is also given in [2]). 



parts of the genome that just happen to be connected by a 
path in the Hamming graph (characteristic to HAMMER). 

In this paper, we introduce the BAYESHAMMER error 
correction tool that does not rely on uniform coverage. 
BAYESHAMMER uses the clustering algorithm of HAM- 
MER as a first step and then refines the constructed clus- 
ters by further subclustering them with a procedure that 
takes into account reads quality values (e.g., provided by 
lUumina sequencing machines) and introduces Bayesian 
(BIC) penalties for extra subclustering parameters. 
BAYESHAMMER subclustering aims to capture the 
complex structure of repeats (possibly of varying cover- 
age) in the genome by separating even very similar 
k-mers that come from different instances of a repeat. 
BAYESHAMMER also uses a new approach for propa- 
gating corrections in /c-mers to corrections in the reads. 
All algorithms in BAYESHAMMER are heavily paralle- 
lized whenever possible; as a result, BAYESHAMMER 
gains a significant speedup with more processing cores 
available. These features make BAYESHAMMER a per- 
fect error correction tool for single-cell sequencing. 

We remark that HAMMER produces only a set of cen- 
tral A^-mers but does not correct reads, making it incompa- 
tible with most genome assemblers. QUAKE does correct 
reads but has severe memory limitations for large k and 
assumes uniform coverage. In contrast, EULER-SR [9] and 
CAMEL [2] correct reads and do not make strong 
assumptions on coverage (both tools have been used for 
single-cell assembly projects [2]) which makes these tools 
suitable for comparison to BAYESHAMMER. Our bench- 
marks show that BAYESHAMMER outperforms these 
tools in both single-cell and standard (multi-cell) modes. 
We further couple BAYESHAMMER with a recently 
developed genome assembler SPADES [10] and demon- 
strate that assembly of BAYESHAMMER-corrected reads 
significantly improves upon assembly with reads corrected 
by other tools for the same datasets, while the total run- 
ning time also improves significantly. 

BAYESHAMMER is freely available for download as 
part of the SPADES genome assembler at http://bioinf. 
spbau.ru/spades/. 



Methods 

Notation and outline 

Let Z = {A, C, G, T} be the alphabet of nucleotides 
(BAYESHAMMER discards /r-mers with uncertain bases 
denoted A/). A k-mer is an element of if, i.e., a string of k 
nucleotides. We denote the letter (nucleotide) of a 
k-mei X by x[i], indexing them from zero: 0<i<k-l. A 
subsequence of x corresponding to a set of indices / is 
denoted by x[I]. We use interval notation [/, /] for inter- 
vals of integers {i, i + 1,..., J} and further abbreviate x[i, j] 
= X [{i, i + 1,..., /}]; thus, X = x[0, k - 1]. Input reads are 
represented as a set of strings R c along with their 
quality values ((7r['])Il'(7^ for ^sxh reR. We assume that 
qr[i] estimates the probability that there has been an 
error in position i of read r. Notice that in practice, the 
fastq file format [11] contains characters that encode 
probabilities on a logarithmic scale (in particular, pro- 
ducts of probabilities used below correspond to sums of 
actual quality values). 

Below we give an overview of BAYESHAMMER work- 
flow (Figure 2) and refer to subsequent sections for further 
details. On Step (1), ^-mers in the reads are counted, pro- 
ducing a triple statistics{x) = {count„ quality^, error^^) for 
each k-mer x. Here, countx is the number of times x 
appears as a substring in the reads, quality^ is its total qual- 
ity expressed as a probability of sequencing error in x, and 
error^c is a A^-dimensional vector that contains products of 
error probabilities (sums of quality values) for individual 
nucleotides of x across all its occurrences in the reads. On 
Step (2), we find connected components of the Hamming 
graph constructed from this set of k-mers. On Step (3), the 
connected components become subject to Bayesian sub- 
clustering; as a result, for each /c-mer we know the center 
of its subcluster. On Step (4), we filter subcluster centers 
according to their total quality and form a set of solid 
/c-mers which is then iteratively expanded on Step (5) by 
mapping them back to the reads. Step (6) deals with reads 
correction by counting the majority vote of solid k-mers in 
each read. In the iterative version, if there has been a sub- 
stantial amount of changes in the reads, we run the next 
iteration of error correction; otherwise, output the 
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Figure 2 BAYESHAMMER workflow 



corrected reads. Below we describe specific algorithms 
employed in the BAYESHAMMER pipeline. 

Algorithms 

Step (1): computing k-mer statistics 

To collect k-mer statistics, we use a straightforward hash 
map approach [12] that does not require storing instances 
of all /c-mers in memory (as excessive amount of RAM 
might be needed otherwise). For a certain positive integer 
N (the number of auxiliary files), we use a hash function 



h: Z -^Zf^ that maps k-mers over the alphabet Z to inte- 
gers from 0 to N - 1. 
Algorithm 1 Count A:-mers 

for each /c-mer x from the reads R: do 

compute h{x) and write x to File/,(;c)- 
for i e [0,N- 1]: do 

sort File, with respect to the lexicographic order; 

reading File, sequentially, compute statistics{s) 

for each k-mei s from File,. 
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Step (2): constructing connected components of Hamming 
grapti 

Step (2) is the essence of the HAMMER approach [8]. 
The Hamming distance between /r-mers x, y & is the 
number of nucleotides in which they differ: 

d(x,y) = |{i e [0,fe- 1] : x[i] 7^y[t]}| . 

For a set of /c-mers X, the Hamming graph HGr(X) is an 
undirected graph with the set of vertices X and edges cor- 
responding to pairs of /r-mers from X with Hamming dis- 
tance at most T, i.e., x, y & X are connected by an edge in 
HG,(X) iff A{x, y)<T (Figure 3). To construct HGAX) effi- 
ciently, we notice that if two k-mers are at Hamming dis- 
tance at most T, and we partition the set of indices [0,k - 1] 
into r + 1 parts, then at least one part corresponds to the 
same subsequence in both /r-mers. Below we assume with 
little loss of generality that r + 1 divides k, i.e., k = a (t + 1) 
for some integer a. 

For a subset of indices / £ [0, k - 1], we define a partial 
lexicographic ordering as follows: x <j y i& x[I] < y[I], 
where < is the lexicographic ordering on S*. Similarly, we 
define a partial equality =/ such that x =jy iff *[/] = y[I]. 
We partition the set of indices [0, ^ - 1] into r + 1 parts of 
size CT and for each part /, sort a separate copy of X with 
respect to <j. As noticed above, for every two k-mers x, 
y & X with A{x, y) < t, there exists a part / such that x =iy. 
It therefore suffices to separately consider blocks of 
equivalent /c-mers with respect to =/ for each part /. If a 
block is small (i.e., of size smaller than a certain threshold), 
we go over the pairs of A^-mers in this block to find those 
with Hamming distance at most r. If a block is large, we 
recursively apply to it the same procedure with a different 
partition of the indices. In practice, we use two different 
partitions of [0, - 1]: the first corresponds to contigious 
subsets of indices (recall that a = :^): 



Algorithm 2 Hamming graph processing 

procedure HGPROCESS(X, max_quadratic) 

Init components with singletons = {{x} : x e X} . 

for ail T e FindBlocks (X, {/™}I=o) do 
if |T| > max quadratic then 

for all Ze FindBlocks (T,{/f};,o) do 
ProcessExhaustively (Z, A") 

else 

ProcessExhaustively (T, . 

function FindBlocks (X, {/s}J,o) 
for s = 0,...,r do 

sort a copy of X with respect to <h, getting 

for s = 0,...,T do 

output the set of equiv. blocks {T} w.r.t.=ij. 
procedure PROCESSEXHAUSTIVELY (T, X) 
for each pair x, y & T do 

if d(x, y) < T then join their sets in X : 
for all X e Zx e A", y e Zy e A" do 

X ■.= XU{Z^UZy}\{Z^,Zy}. 

I™' = {sa, sa + l,...,sa + a - 1}, s = 0,...,r, 

while the second corresponds to strided subsets of 
indices: 

If = (s,s + T + 1,s + 2(t + l),...,s + {a- l)(r + 1)), s = 0 r. 

BAYESHAMMER uses a two-step procedure, first 
splitting with respect to {/™'}J,o (Figure 4) and then, if 
an equivalence block is large, with respect to {/j^^lj^g- 
On the block processing step, we use the disjoint set 
data structure [12] to maintain the set of connected 
components. Step (2) is summarized in Algorithm 2. 
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Figure 3 Hamming graphs HGi(X) and HG2(X). Hamming graphs iHG,(X) and HG2(X) for X being the set of 4-mers {ACGTG, CGTGT, GTGTG, 
ACATG, CATGT ATGTG, ACCTG, CCTGT CTGTQ of the reads ACGTGTG, ACATGTG, ACCTGTC. Blue edges denote Hamming distance 2. 
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Figure 4 Partial lexicographic orderings. Partial lexicographic orderings of a set X of 9-mers with respect to the index sets 7™ = {0, 1, 2} , 

J™' = {3, 4, 5} , and /™' = {6, 7, 8} ■ Red dotted lines indicate equivalence blocks. 



Step (3): Bayesian subdustering 

In hammer's generative model [8], it is assumed that 
errors in each position of a k-mei are independent and 
occur with the same probability e, which is a fixed glo- 
bal parameter (HAMMER used e = 0.01). Thus, the like- 
lihood that a A^-mer x was generated from a A^-mer y 
under HAMMER's model equals 

Under this model, the maximum likelihood center of a 
cluster is simply its consensus string [8]. 

In BAYESHAMMER, we further elaborate upon HAM- 
MER's model. Instead of a fixed s, we use reads quality 
values that approximate probabilities qx[i] of a nucleotide 
at position / in the /c-mer x being erroneous. We combine 
quality values from identical k-mers in the reads: for a 
multiset of A^-mers X that agree on the nucleotide, it is 
erroneous with probability H^^xlxlj]- 

The likelihood that a /c-mer x has been generated from 
another k-mer c (under the independent errors assump- 
tion) is given by 

L{x\c)= Yl q,\j] Yl i^-M 

and the likelihood of a specific subdustering C = Ci 
U... U C„ is 

m 

L„,{Ci,...,Cm) = YlY[L[x\Ci) 

1=1 xeC, 

where c, is the center (consensus string) of the sub- 
cluster Q. 

In the subdustering procedure (see Algorithm 3), we 
sequentially subcluster each connected component of the 
Hamming graph into more and more clusters with the 
classical A-means clustering algorithm (denoted w-means 
since k has different meaning). For the objective function, 
we use the likelihood as above penalized for overfitting 



with the Bayesian information criterion (BIC) [13]. In this 
case, there are |C| observations in the dataset, and the 
total number of parameters is 3 km + m - 1: 

• w - 1 for probabilities of subclusters, 

• km for cluster centers, and 

• 2 km for error probabilities in each letter: there are 3 
possible errors for each letter, and the probabilities 
should sum up to one. Here error probabilities are con- 
ditioned on the fact that an error has occurred (alterna- 
tively, we could consider the entire distribution, 
including the correct letter, and get 3 km parameters 
for probabilities but then there would be no need 
to specify cluster centers, so the total number is the 
same). 

Algorithm 3 Bayesian subdustering 

for all connected components C of the Hamming 
graph do 
m := 1 

£i := 2 log Li{C) (likelihood of the cluster gener- 
ated by the consensus) 
repeat 

m := m + 1 

do w-means clustering of C = Q U...U w. 
r.t. the Hamming distance; the initial approx- 
imation to the centers is given by /c-mers that 
have the least error probability 
£m := 2 ■ log Im(Ci,...,C„) {3 km + m - I) ■ 
log |C| 
until €m < £^.1 

output the best found clustering C = U...U C„.i 

Therefore, the resulting objective function is 
ini ■■= 2 ■ logL„,(Ci, . . . , Cm) - (3fem + m - 1) ■ log |C| 

for subdustering into m clusters; we stop as soon as 
£m ceases to increase. 
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Steps (4) and (5): selecting solid k-mers and expanding the 
set of solid k-mers 

We define the quality of a k-mer x as the probabiUty 

(1 — <7j:|)]). The k-mer 

j=0 

quaUties are computed on Step (1) along with comput- 
ing A'-mer statistics. Next, we (generously) define the 
quality of a cluster C as the probability that at least one 
k-mer in C is correct: 



PC = 1-0(1 -P-)- 



xeC 

In contrast to HAMMER, we do not distinguish 
whether the cluster is a singleton (i.e., |C| = 1); there 
may be plenty of superfluous clusters with several 
/c-mers obtained by chance (actually, it is more likely to 
obtain a cluster of several /c-mers by chance than a sin- 
gleton of the same total multiplicity). 

Initially we mark as solid the centers of the clusters 
whose total quality exceeds a predefined threshold (a glo- 
bal parameter for BAYESHAMMER, set to be rather 
strict). Then we expand the set of solid ^-mers iteratively: 
if a read is completely covered by solid A:-mers we con- 
clude that it actually comes from the genome and mark 
all other /c-mers in this read as solid, too (Algorithm 4). 
Step (6): reads correction 

After Steps (l)-(5), we have constructed the set of solid 
/c-mers that are presumably error-free. To construct cor- 
rected reads from the set of solid k-mers, for each base 
of every read, we compute the consensus of all solid 



A^-mers and solid centers of clusters of all non-solid 
k-mers covering this base (Figure 5). This step is for- 
mally described as Algorithm 5. 
Algorithm 4 Solid k-mers expansion 

procedure ITERATIVEEXPANSION(7?, X) 

while ExpansionStep(7?, X) do 
function EXPANSIONSTEP(7?, X) 
for all reads r e 7? do 

if r is completely covered by solid A^-mers 
then 

mark all k-mers in r as solid 
Return TRUE if X has increased and FALSE 
otherwise. 

Algorithm 5 Reads correction 

Input: reads R, solid k-mers X, clusters C ■ 

for all reads r e 7? do 

init consensus array u: [0, |r| - 1] x {A, C, G, I) ^ 
N with zeros: v(j, x[i]):= 0 for all i = 0,...,|r| - 1 and 
/ = 0,...,/c - 1 
for / = 0,...,|r| - k do 

if r[i, i + k - 1] e X {it is solid) then 
for 7 e [i, i + k - 1] do 
v(j, r[i]):= v{j, r[i]) + 1 
if r[i, i + k - I] G C for some C e C then 
let X he the center of C 
if X e X (r belongs to a cluster with solid 
center) then 

for / e [i, i + k - 1] do 



ACGTGTGATGCATGATCG 
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GTGTGATGCAi 
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cr 
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Figure 5 Read correction. Reads correction. Grey k-mers indicate non-solid /;-mers. Red /c-mers are tlie centers of the corresponding clusters 
(two grey k-mets striked through on the right are non-solid singletons). As a result, one nucleotide is changed. 
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v(j, xli]):= v{j, xli]) + 1 
for i e [0, |r| - 1] do 

r[i]:= arg maxaex a). 

Results and discussion 

Datasets 

In our experiments, we used three datasets from [2]: a 
single-cell E. coli, a single-cell S. aureus, and a standard 
(multicell) E. coli dataset. Paired-end libraries were gen- 
erated by an Illumina Genome Analyzer IIx from MDA- 
amplified single-cell DNA and from multicell genomic 
DNA prepared from cultured E. coli, respectively These 
datasets consist of 100 bp paired-end reads with insert 
size 220; both E. coli datasets have average coverage » 
600 X, although the coverage is highly non-uniform in 
the single-cell case. 

In all experiments, BAYESHAMMER used k =21 (we 
observed no improvements for higher values of k). 

fc-mer counts 

Table 1 shows error correction statistics produced by 
di erent tools on all three datasets. For a comparison 
with HAMMER, we have emulated HAMMER with 
read correction by turning off Bayesian subclustering 



[HammerExpanded in the table) and both Bayesian 
subclustering and read expansion, another new idea of 
BAYESHAMMER {HammerNoExpansion in the table). 
Note that despite its more complex processing, BAYE- 
SHAMMER is significantly faster than other error correc- 
tion tools (except, of course, for HAMMER which is 
a strict subset of BAYESHAMMER processing in our 
experiments and is run on BAYESHAMMER code). 
BAYESHAMMER also produces, in the single-cell case, a 
much smaller set of A:-mers in the resulting reads which 
leads to smaller de Bruijn graphs and thus reduces the 
total assembly running time. Since BAYESHAMMER 
trims only bad quality bases and does not, like QUAKE, 
trim bases that it has not been able to correct (it has been 
proven detrimental for single-cell assembly in our experi- 
ments), it does produce a much larger set of /c-mers than 
Quake on a multi-cell dataset. 

For a comparison of BAYESHAMMER with other tools 
in terms of error rate reduction across an average read, 
see the logarithmic error rate graphs on Figure 6. Note 
that we are able to count errors only for the reads that 
actually aligned to the genome, so the graphs are biased in 
this way. Note how the first 21 bases are corrected better 
than others in BAYESHAMMER and both versions of 



Table 1 fc-mer statistics. 



Correction tool 


Running 
time 






fc-mers 




Reads 






Total 


Genomic 


Non- % of all genomic fr-mers 
genomic found in reads 


% genomic among all 
fc-mers in reads 


% reads aligned 
to genome 


Multi-cell E. coli, total 4,543,849 genomic fc-mers 


Uncorrected 




187,580,875 


4,543,684 


183,037,191 99.99 


2.4 


99.05 


Quake 




4,565,237 


4,543,461 


21,776 99.99 


99.5 


99.97 


HammerNoExpansion 


30 m 


58,305,738 


4,543,674 


53,762,064 99.99 


8.4 


95.59 


HammerExpanded 


36 m 


28,290,788 


4,543,673 


23,747,115 99.99 


19.1 


9949 


BayesHammer 


37 m 


27,100,305 


4,543,674 


22,556,631 99.99 


20.1 


99.62 


Single-cell £. coti, total 4,543,849 genomic /r-mers 


Uncorrected 




165,355,467 


4,450,489 


160,904,978 97.9 


2.7 


79.05 


Camel 


2 h 29 m 


147,297,070 


4,450,31 1 


142,846,759 97.9 


3.0 


81.25 


Euler-SR 


2 h 15 m 


138,677,818 


4,450,431 


134,227,387 97.9 


3.2 


81.95 


Cora 


2 h 47 m 


1 56,907,496 


4,449,560 


152,457,936 97.9 


2.8 


80.28 


HammerNoExpansion 


37 m 


53,001,778 


4,443,538 


48,558,240 97.8 


8.3 


81.36 


HammerExpanded 


43 m 


36,471,268 


4,443,545 


32,027,723 97.8 


12.1 


86.91 


BayesHammer 


57 m 


35,862,329 


4,443,736 


31,418,593 97.8 


12.4 


87.12 


Single-cell S. aureus, total 2,821,095 genomic /t-mers 


Uncorrected 




88,331,311 


2,820,394 


85,510,917 99.98 


3.2 


75.07 


Camel 


5 h 13 m 


69,365,311 


2,820,350 


66,544,961 99,97 


4.1 


75.27 


Euler-SR 


2 h 33 m 


58,886,372 


2,820,349 


56,066,023 99.97 


4.8 


75.24 


Coral 


7 h 12 m 


83,249,146 


2,820,011 


80,429,135 99.96 


3.4 


75.22 


HammerNoExpansion 


58 m 


37,465,296 


2,820,341 


34,644,955 99,97 


7.5 


71.63 


HammerExpanded 


1 h 03 m 


23,197,521 


2,820,316 


20,377,205 99.97 


12.1 


76.54 


BayesHammer 


1 h 09 m 


22,457,509 


2,820,311 


19,637,198 99.97 


12.6 


76.60 
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HAMMER since we have run it with k = 21; still, other 
values of k did not show a significant improvement in 
either k-mer statistics or, more importantly, assembly 
results. 

Assembly results 

Tables 2 and 3 shows assembly results by the recently 
developed SPAdes assembler [10]; SPAdes was designed 
specifically for single-cell assembly, but has by now demon- 
strated state-of-the-art results on multi-cell datasets as well. 



In the tables, N50 is such length that contigs of that 
length or longer comprise > j oi the assembly; NG50 is 
a metric similar to N50 but only taking into account 
contigs comprising (and aligning to) the reference gen- 
ome; NA50 is a metric similar to N50 after breaking up 
misassembled contigs by their misassemblies. NGx and 
NAx metrics have a more direct relevance to assembly 
quality than regular Nx metrics; our result tables have 
been produced by the recently developed tool QUAST 
[14]. 



Table 2 Assembly results, single-ceil E.coli and S. aureus datasets (contigs of length > 200 are used). 



Statistics 


BayesHammer 


BayesHammer 


Coral 


Coral 


EulerSR 


EulerSR 


Hammer, 


Hammer, 


Hammer, no 


Hammer 






(scaff old) 




(scaff 
old) 




(scaff 
old) 


expanded 


no 

expansion 


expansion(scaff 
old) 


(scaff 
old) 






Single 


-cell E. coli, reference length 4639675, reference GC content 50.79% 




# contigs 


191 


158 


276 


224 


231 


150 


195 


282 


242 


173 


(1000 bp) 






















# contigs 


521 


462 


675 


592 


578 


375 


529 


655 


592 


477 


Largest contig 


269177 


284968 


1 79022 


1 79022 


267676 


267676 


268464 


210850 


210850 


268464 


Total length 


4952297 


4989404 


5064570 


4817757 


4817757 


4902434 


4977294 


5097148 


5340871 


5005022 


N50 


110539 


1 1 3056 


45672 


67849 


74139 


95704 


97639 


65415 


84893 


109826 


NG50 


1 1 2065 


118432 


55073 


87317 


77762 


108976 


101871 


68595 


96600 


112161 


NA50 


110539 


113056 


45672 


67765 


74139 


95704 


97639 


65415 


84841 


109826 


NGA50 


1 1 2064 


1 1 8432 


55073 


87317 


77762 


108976 


101871 


68594 


96361 


112161 


# 


4 


6 


9 


12 


6 


8 


4 


4 


7 


7 


misassemblies 






















# 


4 


6 


9 


10 


6 


8 


4 


4 


7 


7 


misassembled 






















contigs 






















Misass. contigs 


42496 


94172 


62114 


1 50232 


47372 


149639 


43304 


26872 


147140 


1 30706 


length 






















Genome 


96.320 


96315 


96.623 


96.646 


95.337 


95.231 


96.287 


96.247 


96.228 


96.281 


covered (%) 






















GC {%) 


49.70 


49.69 


49.61 


49.56 


49.90 


49.74 


49.68 


49.64 


49.60 


49.68 


# mismatches/ 


11.22 


11.70 


8.36 


9.10 


5.55 


5.82 


12.77 


5411 


52.48 


13.08 


100 kbp 






















# indels/100 


1.07 


8.26 


9.17 


12.76 


0.52 


47.80 


0.91 


1.17 


7.96 


8.69 


kbp 






















# genes 


4065 + 


4079 + 


3998 + 


4040 + 


3992 + 


4020 + 


4068 + 


4034 + 


4048 + 


4078 + 




1 24 part 


110 part 


1 80 part 


143 part 


140 part 


107 part 


123 part 


152 part 


1 36 part 


111 part 
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Table 2 Assembly results, single-cell E.coli and S. aureus datasets (contigs of length > 200 are used). (Continued) 



Single-cell S. aureus, reference length 2872769, reference GC content 32.75% 





95 


85 


132 


1 13 


82 


70 


1 14 


272 


258 


101 


(1000 bp) 






















Tnt:^l Ipnnth 

1 \J LO 1 1 1 1 L 1 1 


301 9597 


3309342 


3055585 


3066667 


2972925 


29931 00 


3033912 


3389846 


3405223 


3509555 


(1000 bp) 






















# contigs 


zDU 


T/l 1 

z4 1 




4zo 


1 oo 


1 3/1 


3 1 T 
J 1 Z 


/Z 1 


71 1 
/ 1 1 


zyz 


Largest contig 


282558 


328686 


208166 


208166 


254085 


535477 


282558 


148002 


1 66053 


328679 


Total length 


3081173 


3368034 


3160497 


3166169 


3008746 


3020256 


3111423 


3575679 


3591-168 


3584266 


N50 


87684 


145466 


62429 


90701 


101836 


145466 


74715 


30788 


34943 


131272 


NG50 


112566 


194902 


87636 


99341 


108151 


159555 


88292 


39768 


45889 


1 80022 


NA50 


87684 


145466 


62429 


89365 


100509 


145466 


68711 


30788 


34552 


112801 


NGA50 


88246 


148064 


74452 


90101 


101836 


145466 


88289 


35998 


42642 


148023 


# 


15 


17 


11 


14 


4 


5 


1 1 


14 


18 


14 


misassemblies 






















# 


12 


14 


9 


10 


4 


5 


9 


14 


16 


12 


misassembled 






















contigs 






















Misass. contigs 


340603 


779785 


478009 


523596 


377133 


918380 


402997 


272677 


324361 


940356 


length 






















Genome 


99.522 


99483 


99.449 


99.447 


99.213 


99.254 


99.204 


98.820 


98.888 


99.221 


covered (%) 






















GC (%) 


32.67 


32.63 


32.64 


32.63 


32.66 


32.67 


32.67 


32.39 


32.38 


32.57 


# mismatches 


3.18 


8.01 


12.44 


12.65 


9.72 


10.28 


17.38 


54.92 


55.50 


15.36 


per 100 kbp 






















# indels per 


2.17 


2.30 


15.50 


15.67 


3.80 


4.08 


3.57 


2.64 


2.72 


3.04 


100 kbp 






















# genes 


2540 + 


2547 + 


2532 + 


2540 + 


2547 + 


2550 + 


2535 + 


2477 + 


2485 + 


2539 + 




36 part 


30 part 


45 part 


37 part 


30 part 


27 part 


41 part 


91 part 


85 part 


38 part 



Table 3 Assembly results, multi-cell E.coli dataset (contigs of length > 200 are used). 

Statistics BayesHammer BayesHammer Hammer, Hammer, no Hammer, no Hammer Quake 

(sca_old) expanded expansion expansion (sca_old) 

(sca_old) 



Multi-cell E. coli, 600 coverage, reference length 4639675, reference GC content 50.79% 



# contigs (> 500 bp) 


103 


102 


119 


238 


213 


115 


165 


# contigs (> 1000 bp) 


91 


90 


99 


192 


171 


96 


156 


Total length (> 500 bp) 


4641845 


4641 790 


4626515 


4730338 


4817457 


4627067 


4543682 


Total length (> 1000 bp) 


4633361 


4633306 


461 1 745 


4696966 


4787210 


4612838 


4537565 


# contigs 


122 


121 


146 


325 


303 


141 


204 


Largest contig 


285113 


285113 


218217 


210240 


210240 


218217 


165487 


Total length 


4647325 


4647270 


4635156 


4756088 


4844208 


4635349 


4555015 


N50 


132645 


132645 


113608 


59167 


73113 


113608 


58777 


NG50 


132645 


132645 


113608 


59669 


80085 


113608 


57174 


NA50 


132645 


132645 


113608 


59167 


73113 


113608 


58777 


NGA50 


132645 


132645 


113608 


59669 


80085 


113608 


57174 


# misassemblies 


3 


3 


4 


4 


7 


5 


0 


# misassembled contigs 


3 


3 


4 


4 


7 


5 


0 


Misassembled contigs 


44466 


44466 


57908 


15259 


30901 


60418 


0 


length 
















Genome covered (%) 


99.440 


99.440 


99.383 


98.891 


98.925 


99.385 


98.747 


GC (%) 


50.78 


50.77 


50.77 


50.73 


50.71 


50.77 


50.75 


N's (%) 


0.00000 


0.00000 


0.00000 


0.00000 


0.00000 


0.00000 


0.00000 
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Table 3 Assembly results, multi-cell E.coli dataset (contigs of length > 200 are used). (Continued) 



# mismatches per 100 


8.55 


8.55 


13.76 


44.46 


44.33 


13.76 


1.21 


kbp 
















# indels per 100 kbp 


0.99 


0.99 


1.14 


0.76 


0.97 


1.14 


0.20 


# genes 


4254+45 part 


4254+45 part 


4245+56 part 


41 96+72 part 


4204+58 part 


4245+56 part 


41 74+62 part 



All assemblies have been done with SPADES. The 
results show that after BAYESHAMMER correction, 
assembly results improve significantly, especially in the 
single-cell E. coli case; it is especially interesting to note 
that even in the multi-cell case, where BAYESHAMMER 
loses to QUAKE by k-mer statistics, assembly results actu- 
ally improve over assemblies produced from QUAKE-cor- 
rected reads (including genome coverage and the number 
of genes). 

Conclusions 

Single-cell sequencing presents novel challenges to error 
correction tools. In contrast to multi-cell datasets, for 
single-cell datasets, there is no pretty distribution of k- 
mer multiplicities; one therefore has to work with k- 
mers on a one-by-one basis, considering each cluster of 
Ar-mers separately. In this work, we further developed 
the ideas of HAMMER from a Bayesian clustering per- 
spective and presented a new tool BAYESHAMMER 
that makes them practical and yields significant 
improvements over existing error correction tools. 

There is further work to be done to make our underlying 
models closer to real life; for instance, one could learn a 
non-uniform distribution of single nucleotide errors and 
plug it in our likelihood formulas. Another natural 
improvement would be to try and rid the results of con- 
tamination by either human or some other DNA material; 
we observed significant human DNA contamination in our 
single-cell dataset, so weeding it out might yield a signifi- 
cant improvement. Finally, a new general approach that we 
are going to try in our further work deals with the techni- 
que of minimizers introduced by Roberts et al. [15]. It may 
provide significant reduction in memory requirements and 
a possible approach to dealing with paired information. 
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